Skip to content

Restructure Cosmos benchmark agent to 3 deterministic skills#48165

Draft
xinlian12 wants to merge 23 commits intoAzure:mainfrom
xinlian12:cosmos-benchmark-agent
Draft

Restructure Cosmos benchmark agent to 3 deterministic skills#48165
xinlian12 wants to merge 23 commits intoAzure:mainfrom
xinlian12:cosmos-benchmark-agent

Conversation

@xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Feb 27, 2026

Cosmos Benchmark Agent

Overview

Adds a Copilot-powered Cosmos DB benchmark agent that automates the full benchmark lifecycle: provisioning infrastructure, running benchmarks, and analyzing results. The agent is organized into 3 deterministic, script-driven skills with a clear sequential workflow:

setup-resources → run → analyze

How to Use (Copilot CLI)

1. Select the agent

From the Copilot CLI, use @ to select the cosmos-benchmark agent:

$ @cosmos-benchmark setup resources for a benchmark run in West US 2

Or start a session and the agent will be auto-selected based on context when working under sdk/cosmos/azure-cosmos-benchmark/.

2. Example workflows

Full benchmark — provision, run, analyze:

You: setup resources with 50 cosmos accounts in West US 2
Agent: ✅ Created 50 accounts, 1 VM, exported config to ~/dev/benchmark-config

You: run benchmark on origin/main and xinlian12/wireConnectionSharingInBenchmark, simple preset, 10 min
Agent: ✅ Launched. Tmux running on VM. Polling...
Agent: ✅ origin/main completed. Starting next ref...
Agent: ✅ Both refs completed.

You: analyze results
Agent: 📊 Downloaded results. Generating comparison report...

Quick validation of a PR:

You: run benchmark on PR#12345 vs main, simple preset

Check on a running benchmark:

You: peek
Agent: ✅ Tmux running. 15 monitor samples. Threads: 358, Heap: 4GB/8GB, CPU: 18.7%

Reuse existing infrastructure:

You: setup resources, reuse existing cosmos accounts in rg-benchmark-west, VM at 20.98.84.14

3. Key commands

What you say What happens
setup resources Provisions Cosmos accounts, App Insights, VM
run benchmark on <refs> Builds JAR, generates tenants.json, runs on VM in tmux
peek / check status Shows tmux state, monitor metrics, results status
analyze results Downloads from VM, generates comparison report
capture diagnostics Takes thread/heap dump of running benchmark

Agent Structure

azure-cosmos-benchmark/
└── copilot/
    ├── agents/
    │   └── cosmos-benchmark.agent.md          # Routing table → 3 skills
    └── skills/
        ├── cosmos-benchmark-setup-resources/  # Step 1: Azure infrastructure
        ├── cosmos-benchmark-run/              # Step 2: Build & execute
        ├── cosmos-benchmark-analyze/          # Step 3: Download & report
        └── skill-creator/                     # Meta: skill authoring guide

Skills

1. setup-resources — Provision Azure Infrastructure

Creates Cosmos DB accounts, Application Insights, and Azure VMs. Exports credentials to a config directory consumed by downstream skills.

Script flow:

provision-all.sh                              # Entrypoint: orchestrates all provisioning
│
├── [1/5] validate-capacity.sh                # Pre-flight: check region has VM SKU + Cosmos capacity
│         └── Outputs capacity-check.json     #   Blocks provisioning if checks fail
│
├── [2/5] az group create                     # Create resource group
│
├── [3/5] Parallel resource creation ─────────────────────────────────────
│   ├── create-cosmos-accounts.sh  (bg)       # Creates N Cosmos DB accounts in parallel
│   ├── az monitor app-insights ... (bg)      # Creates Application Insights
│   └── provision-benchmark-vm.sh  (bg)       # Creates VM, installs JDK 21 + Maven 3.9
│       └── SSH → apt install, download JDK   #   Writes vm-ip, vm-user, vm-key to config-dir
│                                             # Waits for all 3 background jobs
│
├── [4/5] export-cosmos-credentials.sh        # Fetches account keys → clientHostAndKey.txt
│
└── [5/5] verify-resources.sh                 # Health check: SSH to VM, test Cosmos connectivity

Config directory outputs (consumed by run skill):

<config-dir>/
├── vm-ip, vm-user, vm-key               # VM SSH connection info
├── vm-config.env                         # VM_IP, VM_USER, VM_KEY_PATH
├── clientHostAndKey.txt                  # Cosmos account endpoints + keys
├── app-insights-connection-string.txt    # Application Insights connection string
└── logs/                                 # Per-resource provisioning logs
    ├── capacity-check.json
    ├── cosmos-accounts.log
    ├── app-insights.log
    ├── vm.log
    └── export-credentials.log

2. run — Build & Execute Benchmarks

Clones repo at specified branch/PR/commit, builds the benchmark JAR, and executes scenarios on the VM inside a tmux session for resilience against SSH disconnections.

Script flow:

generate-tenants.sh                       # Generates tenants.json from config-dir credentials
  └── SCPs tenants.json to VM

run-all-refs.sh                           # Entrypoint: orchestrates N refs sequentially
│   (for each ref)
│   ├── SCP vm-prepare-and-run.sh → VM    # Copy bootstrapper to VM
│   ├── tmux new-session                  # Start tmux on VM (survives SSH drops)
│   │   └── vm-prepare-and-run.sh         # Runs ON the VM inside tmux
│   │       ├── git checkout <ref>        # Auto-detects branch/PR/commit/tag/fork
│   │       ├── mvn install (linting-extensions + benchmark JAR)
│   │       ├── Verify readiness (JDK, JAR, tenants.json, disk)
│   │       └── run-benchmark.sh          # Launches java benchmark process
│   │           ├── java -cp benchmark.jar Main -tenantsFile tenants.json ...
│   │           └── monitor.sh            # External JVM monitoring (threads, heap, FDs, GC)
│   └── Poll tmux until complete
└── Print summary (✅/❌ per ref)

check-status.sh                           # Standalone: check VM state anytime
  └── SSH → tmux status, results dirs, git state, build status, system resources

capture-diagnostics.sh                    # Standalone: capture thread/heap dumps mid-run
  └── SSH → jstack, jmap, JFR on running benchmark PID

Supports: multiple refs for comparison (main vs feature branch), scenario presets (SIMPLE ~30 min, EXPAND ~90 min, CHURN for leak detection), --force-copy-scripts to test local script changes.

3. analyze — Download & Report Results

Downloads results from the VM and generates comparison reports with pass/fail thresholds.

Script flow:

download-results.sh                       # SCP results from VM → local
  └── results/<run-name>/
      ├── monitor.csv                     # External JVM metrics (threads, heap, FDs, GC)
      ├── metrics/                        # Codahale CSV metrics (latency, throughput)
      ├── gc.log                          # G1GC log
      ├── git-info.json                   # Branch, commit SHA
      └── heap-dumps/                     # If OOM or manually triggered

generate-report.py                        # Generates markdown report
  ├── Parse monitor.csv → time-series charts (thread count, heap, FDs)
  ├── Parse metrics/ → latency percentiles, throughput tables
  ├── Compare runs → side-by-side delta tables
  └── Apply thresholds → PASS/FAIL per metric (from references/thresholds.md)

Benchmark Modes

The framework supports two modes — purely a configuration choice:

  • Single-tenant: Pass connection details directly via CLI flags
  • Multi-tenant: Pass -tenantsFile tenants.json with multiple account configurations

Both use the same JAR, orchestrator, and monitoring infrastructure.

Key Design Decisions

  • Scripts over code: All infrastructure and orchestration logic is in bash scripts, making it easy to run manually or debug
  • tmux resilience: Benchmarks run in tmux sessions on the VM, surviving SSH disconnections
  • Config directory pattern: Each skill reads/writes to a shared config directory, enabling clean handoff between steps
  • Force-copy-scripts flag: --force-copy-scripts overrides repo scripts with local versions for testing changes before they are merged
  • Mandatory post-launch verification: After launching, the agent must run check-status.sh to verify the benchmark is actually running before reporting success to the user

Additional Changes

  • Adds skill-creator skill: a meta-skill for authoring new skills with proper structure and conventions

Annie Liang and others added 2 commits February 27, 2026 11:40
Add benchmark shell scripts for VM provisioning, setup, execution,
monitoring, diagnostics capture, and dashboard generation.

Update BenchmarkConfig, BenchmarkOrchestrator, and TenantWorkloadConfig
to support multi-tenant benchmark orchestration with per-tenant
configuration overrides.

Add .gitignore entries for benchmark artifacts and Copilot skills.
Add test-setup and test-results directory scaffolding with READMEs
and a sample tenants.json template (no real credentials).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent routing file dispatches to 5 skills covering the full
benchmark/DR drill lifecycle:

- provision: Cosmos DB accounts, App Insights, Azure VMs
- setup: JDK/Maven install, repo clone, config generation, build
- run: CHURN preset execution, multi-VM parallel, App Insights config
- analyze: CSV metrics, run comparison, heap/thread dumps, Kusto export
- status: resource health, run overview, App Insights verification

Also includes skill-creator utility for authoring new skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Annie Liang and others added 2 commits February 27, 2026 14:07
…ate runtime config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 force-pushed the cosmos-benchmark-agent branch from bf46a2f to c43f5b6 Compare February 27, 2026 22:50
Consolidate the benchmark agent from 5 skills down to 3, with deterministic
script-driven flows replacing inline commands.

Skills:
- setup-resources: provision Azure infra (Cosmos DB, App Insights, VM) with
  parallel creation, capacity validation, region fallback, and verification gate
- run: clone/build/verify/execute benchmarks via single SSH session per ref,
  supports multiple refs for comparison, SIMPLE/EXPAND/CHURN presets
- analyze: download results to config-dir/results, generate markdown report
  with time-series SVG charts and multi-run comparison tables

Key changes:
- Rename provision -> setup-resources, merge setup into run, remove status
- .github/skills and .github/agents use symlinks to copilot/ (single source)
- Default region westus2, resource group rg-cosmos-benchmark-YYYYMMDD
- Config directory prompted with credential-in-repo warning
- provision-all.sh orchestrates parallel resource creation + verification
- vm-prepare-and-run.sh consolidates checkout/build/verify/run in 1 SSH session
- run-all-refs.sh loops over user-provided refs with per-ref result directories
- generate-report.py reads monitor.csv + metrics/*.csv, outputs report.md
- Remove parse_hprof.py, kusto-schema.md, generate-dashboard.py (deferred)
- Remove trigger-benchmark.sh (superseded by vm-prepare-and-run.sh)
- Merge setup-benchmark-vm.sh into provision-benchmark-vm.sh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 changed the title [Cosmos Benchmark]AddBenchmarkAgentAndSkills Restructure Cosmos benchmark agent to 3 deterministic skills Mar 2, 2026
Annie Liang and others added 18 commits March 2, 2026 15:00
- Add timestamped progress logging to validate-capacity.sh
- Fix restriction detection to handle all types (Zone, NotAvailableForSubscription)
- Replace slow per-SKU API calls with single-call alternative SKU search
- Add --find-alternatives flag to control similar SKU search
- Add restriction_reason field to JSON output
- Derive quota family dynamically from effective SKU

- Add --fallback-regions flag to find-region.sh for user-specified regions
- Implement 4-phase search: preferred exact → preferred similar → fallback exact → fallback similar
- Add [N/M] progress updates printed as each region completes
- Add --stop-on-first flag (default: true)
- Fix integration bugs: JSON path, exit code logic, stdin-based parsing

- Update SKILL.md to document new flags and search strategy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add capacity validation step before resource creation that blocks
  unless all checks pass (VM SKU, quota, Cosmos DB, App Insights)
- Add --skip-capacity-check flag to override the gate
- Add timestamped log() function for all progress messages
- Add elapsed time tracking per resource and total provisioning time
- Fix JSON parsing to match validate-capacity.sh output format
- Update SKILL.md to document new behavior and flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap benchmark execution in tmux session ('bench') on VM so the
  process survives SSH disconnections
- Add async execution guidance to SKILL.md so the agent runs the
  orchestrator in background mode, keeping the user's context free
- Use scenario-based poll intervals (2min for SIMPLE, 5min for
  EXPAND/CHURN) instead of 10s fixed polling
- Expand monitoring section with local and VM-side status checks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Detect refs like 'xinlian12/branchName' by checking if the part
  before the first slash matches an existing git remote
- If remote exists, fetch from that remote; otherwise treat the
  slash as part of the branch name on origin
- Document fork branch format in SKILL.md ref examples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Instruct agent to proactively verify the run is progressing after
  async launch — if the shell exits too quickly, investigate
- Add diagnosis steps: check results dirs, git state, JAR, tmux
- Document common failures table (checkout, build, startup, SSH)
- Require confirming with user before relaunching after a failure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- New script checks tmux session, results directories (with per-run
  status), git state, build status, and optionally system resources
- Supports --run-name for run-specific details (monitor samples,
  metrics, disk usage) and --verbose for system resource info
- Updated SKILL.md to reference check-status.sh in monitoring and
  troubleshooting sections
- Fix SSH stdin consumption in while-read loop with -n flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SCP vm-prepare-and-run.sh, run-benchmark.sh, monitor.sh, and
  capture-diagnostics.sh to ~/benchmark-scripts/ on the VM
- Execute remotely via 'bash ~/benchmark-scripts/vm-prepare-and-run.sh'
  instead of 'bash -s' stdin piping which broke heredocs
- Update vm-prepare-and-run.sh to reference co-located scripts from
  ~/benchmark-scripts/ in the tmux run script

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run-all-refs.sh now only SCPs vm-prepare-and-run.sh (the bootstrapper)
  instead of all 4 scripts
- After checkout, vm-prepare-and-run.sh resolves scripts from the
  cloned repo (copilot/skills/.../scripts/) so they match the ref
  being benchmarked
- Falls back to ~/benchmark-scripts/ if the repo doesn't include
  the scripts yet (e.g., older branches)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run-all-refs.sh: --force-copy-scripts copies ALL scripts to VM
  (not just the bootstrapper) and passes --force-scripts to the
  bootstrapper
- vm-prepare-and-run.sh: --force-scripts overrides repo-first
  resolution, using ~/benchmark-scripts/ (the SCP'd copies) instead
- Default behavior unchanged: repo scripts used after checkout

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run-all-refs.sh now starts vm-prepare-and-run.sh inside a tmux
  session, so checkout, build, verify AND run all survive SSH
  disconnection
- vm-prepare-and-run.sh Step 4 simplified: runs run-benchmark.sh
  directly (no nested tmux, no .run.sh heredoc generation)
- Polling and exit code logic moved to run-all-refs.sh orchestrator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Write a small /tmp/bench-launch.sh on the VM that wraps
  vm-prepare-and-run.sh and writes the exit code
- Avoids nested quoting issues (SSH -> tmux -> bash -> args)
- Fix stale EXIT_CODE_FILE variable reference

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use $HOME instead of ~ in double-quoted string to ensure correct
path expansion when interpolated into SSH commands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When run-benchmark.sh is executed from ~/benchmark-scripts/ (SCP'd
copy), SCRIPT_DIR/../ doesn't point to the benchmark module. Fall
back to PWD if the script's parent doesn't contain a target/ dir.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gDirectory

- --tenantsFile -> -tenantsFile (JCommander uses single dash)
- Remove --scenario and --outputDir (not valid Configuration params)
- Add -reportingDirectory for CSV metrics output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace fire-and-forget async launch with a two-step workflow:
Step A: Launch orchestrator with sync mode (initial_wait: 60)
Step B: Mandatory verify via check-status.sh within 90s

Prevents the agent from telling the user 'it's running' without
actually confirming tmux is alive and results directory exists.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, createBenchmarks() initialized Cosmos clients sequentially
in a for loop. With 50 tenants, each taking ~10-15s (connect + create
DB/container + populate docs), initialization alone took ~8-10 minutes.

Now submits all tenant initializations to the existing ExecutorService
in parallel, collecting results via Future.get(). With 50 tenants on
a 50-thread pool, initialization completes in ~15-20s instead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant